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Abstract 

In this paper we address the problem of pool based active learning, and provide an algorithm, called 
UPAL, that works by minimizing the unbiased estimator of the risk of a hypothesis in a given hypothesis 
space. For the space of linear classifiers and the squared loss we show that UPAL is equivalent to an ex- 
ponentially weighted average forecaster. Exploiting some recent results regarding the spectra of random 
matrices allows us to establish consistency of UPAL when the true hypothesis is a linear hypothesis. Em- 
pirical comparison with an active learner implementation in Vowpal Wabbit, and a previously proposed 
pool based active learner implementation show good empirical performance and better scalability. 



1 Introduction 

In the problem of binary classification one has a distribution T> on the domain X x y C R d x {— 1, +1}, and 
access to a sampling oracle, which provides us i.i.d. labeled samples S = {(xi, yi), . . . , [x n , y n )}- The task is 
to learn a classifier h, which predicts well on unseen points. For certain problems the cost of obtaining labeled 
samples can be quite expensive. For instance consider the task of speech recognition. Labeling of speech 
utterances needs trained linguists, and can be a fairly tedious task. Similarly in information extraction, 
and in natural language processing one needs expert annotators to obtain labeled data, and gathering huge 
amounts of labeled data is not only tedious for the experts but also expensive. In such cases it is of interest 
to design learning algorithms, which need only a few labeled examples for training, and also guarantee good 
performance on unseen data. 

Suppose we are given a labeling oracle 0, which when queried with an unlabeled point x returns the 
label y of x. Active learning algorithms query this oracle as few times as possible and learn a provably good 
hypothesis from these labeled samples. Broadly speaking active learning (AL) algorithms can be classified 
into three kinds, namely membership query (MQ) based algorithms, stream based algorithms and pool based 
algorithms. All these three kinds of AL algorithms query the oracle O for the label of the point, but differ in 
the nature of the queries. In MQ based algorithms the active learner can query for the label of a point in the 
input space X, but this query might not necessarily be from the support of the marginal distribution T>x- 
With human annotators MQ algorithms might work poorly as was demonstrated by Lang and Baum in the 



case of handwritten digit recognition (1992), where the annotators were faced with the awkward situation of 



labeling semantically meaningless images. Stream based AL algorithms ( Cohn et al. 1994 Chu et al. 2011) 
sample a point x from the marginal distribution T>x, and decide on the fly whether to query O for the label 
of x? Stream based AL algorithms tend to be computationally efficient, and most appropriate when the 
underlying distribution changes with time. Pool based AL algorithms assume that one has access to a large 
pool V = {xi, . . . ,x n } of unlabeled i.i.d. examples sampled from T>x, and given budget constraints B, the 
maximum number of points they are allowed to query, query the most informative set of points. Both pool 
based AL algorithms, and stream based AL algorithms overcome the problem of awkward queries, which 
MQ based algorithms face. However in our experiments we discovered that stream based AL algorithms 
tend to query more points than necessary, and have poorer learning rates when compared to pool based AL 
algorithms. 
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1.1 Contributions. 



In this paper we propose a pool based active learning algorithm called UPAL, which given a hypothesis 
space "H, and a margin based loss function <f>{-) minimizes a provably unbiased estimator of the risk 
E[c/)(yh(x))]. While unbiased estimators of risk have been used in stream based AL algorithms, no 
such estimators have been introduced for pool based AL algorithms. We do this by using the idea of 



importance weights introduced for AL in Beygelzimer et al. (2009). Roughly speaking UPAL proceeds 



in rounds and in each round puts a probability distribution over the entire pool, and samples a point 
from the pool. It then queries for the label of the point. The probability distribution in each round 
is determined by the current active learner obtained by minimizing the importance weighted risk over 
H. Specifically in this paper we shall be concerned with linear hypothesis spaces, i.e. H = R d . 



In theorem [2] (Section [2l]) we show that for the squared loss UPAL is equivalent to an exponentially 
weighted average (EWA) forecaster commonly used in the problem of learning with expert advice ( |Cesa- 



Bianchi and Lugosi| 2006 ). Precisely we show that if each hypothesis h 6 H is considered to be an expert 



and the importance weighted loss on the currently labeled part of the pool is used as an estimator of the 
risk of h £ H, then the hypothesis learned by UPAL is the same as an EWA forecaster. Hence UPAL 
can be seen as pruning the hypothesis space, in a soft manner, by placing a probability distribution 
that is determined by the importance weighted loss of each classifier on the currently labeled part of 
the pool. 

In section [3] we prove consistency of UPAL with the squared loss, when the true underlying hypothesis 
is a linear hypothesis. Our proof employs some elegant results from random matrix theory regarding 
eigenvalues of sums of random matrices (Hsu et al. 2011a|b Tropp 2010). While it should be possible 
to improve the constants and exponent of dimensionality involved in no t $, Tq^s, T\^$ used in theorem|3j 
our results qualitatively provide us the insight that the the label complexity with the squared loss will 
depend on the condition number, and the minimum eigenvalue of the covariance matrix S. This kind 
of insight, to our knowledge, has not been provided before in the literature of active learning. 

In section [5] we provide a thorough empirical analysis of UPAL comparing it to the active learner 
implementation in Vowpal Wabbit (V W) (|Langford et al. 2011), and a batch mode active learning 



algorithm, which we shall call as BMAL (Hoi et al. 2006 1. These experiments demonstrate the positive 



impact of importance weighting, and the better performance of UPAL over the VW implementation. 
We also empirically demonstrate the scalability of UPAL over BMAL on the MNIST dataset. When 
we are required to query a large number of points UPAL is upto 7 times faster than BMAL. 



2 Algorithm Design 

A good active learning algorithm needs to take into account the fact that the points it has queried might 
not reflect the true underlying marginal distribution. This problem is similar to the problem of dataset 



shift (Quinonero et al. 2008) where the train and test distributions are potentially different, and the learner 
needs to take into account this bias during the learning process. One approach to this problem is to use 
importance weights, where during the training process instead of weighing all the points equally the algorithm 
weighs the points differently. UPAL proceeds in rounds, where in each round t, we put a probability 
distribution {p*}™ =1 on the entire pool V, and sample one point from this distribution. If the sampled point 
was queried in one of the previous rounds 1, . . . , t — 1 then its queried label from the previous round is reused, 
else the oracle O is queried for the label of the point. Denote by Q\ £ {0, 1} a random variable that takes 
the value 1 if the point Xi was queried for it's label in round t and otherwise. In order to guarantee that 
our estimate of the error rate of a hypothesis h G % is unbiased we use importance weighting, where a point 

Qt 

Xi € V in round t gets an importance weight of . Notice that by definition E[Q'|p'] = 1. We formally prove 
that importance weighted risk is an unbiased estimator of the true risk. Let T> n denote a product distribution 
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on (xi, yi), . . . , (x n , y n ). Also denote by Q}:^ the collection of random variables Q\, . . . , Q\, . . . , Q l n . Let 
(•, •) denote the inner product. We have the following result. 



Theorem 1. Let L t (h) = — t Ya=i J2 t =i ^-<l>(Vi(h,Xi)), where pj > for all r = 1, . . . , t. Then 

®Ql,...,Qi,vM h ) = L ( h )- (!) 



Proof. 

n t q t _^ n 



i=l T=l ^ 

i=l r=l 

The theorem guarantees that as long as the probability of querying any point in the pool in any round is 
non-zero L t (h), will be an unbiased estimator of L(h). How does one come up with a probability distribution 
on V in round i? To solve this problem we resort to probabilistic uncertainty sampling, where the point whose 
label is most uncertain as per the current hypothesis, h^.t-i, gets a higher probability mass. The current 
hypothesis is simply the minimizcr of the importance weighted risk in % 1 i.e. /i^.t-i = argmin^g^ L t -\{h). 
For any point Xi £ V, to calculate the uncertainty of the label yi of Xi, we first estimate n(xi) = f V[yi — l\xi] 
using h,A,t—ij an d then use the entropy of the label distribution of Xi to calculate the probability of querying 
Xi. The estimate of 7y(-) in round t depends both on the current active learner hA,t—ii and the loss function. 
In general it is not possible to estimate r)(-) with arbitrary convex loss functions. However it has been 
shown by Zhang (2004 1 that the squared, logistic and exponential losses tend to estimate the underlying 



conditional distribution n(-). Steps 4, 11 of algorithm [I] depend on the loss function </>(•) being used. If 
we use the logistic loss i.e 4>(yz) = ln(l + exp(— yzj) then r] t (x) = 1+cxp (_^foT ^y. In case of squared loss 

rjt(x) = min{max{0, j^a;}, 1}. Since the loss function is convex, and the constraint set T-L is convex, the 
minimization problem in step 11 of the algorithm is a convex optimization problem. 

By design UPAL might requery points. An alternate strategy is to not allow requerying of points. 
However the importance weighted risk may not be an unbiased estimator of the true risk in such a case. 
Hence in order to retain the unbiasedness property we allow requerying in UPAL. 

2.1 The case of squared loss 

It is interesting to look at the behaviour of UPAL in the case of squared loss where 4>(yh T x) = (1 — yh T x) 2 . 
For the rest of the paper we shall denote by h& the hypothesis returned by UPAL at the end of T rounds. 
We now show that the prediction of Ha on any x is simply the exponentially weighted average of predictions 
of all h in T~L. 



Theorem 2. Let 



Q l - dot 

t=l 1 » (=1 

n n 

dcf def 

/ ^iyiXi C > Z{. 

i=l i=l 



Define w € 



J Rd exp(~L T (h)) dh 
Assuming E z is invertible we have for any xq G R d , w t xq = h T A XQ. 



J Rd cxp(- L T (h))h dh ^ 
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Algorithm 1 UPAL (Input: V = {x\, . . . ,x n , }, Loss function </>(•), Budget B, Labeling Oracle O) 
1. Set num_unique_queries=0, h,A,o = 0, t = 1. 
while num .unique .queries < B do 
2. Set Q* = for all i = 1,.. . ,n. 
for xi,. . . ,x n £? do 

3. Set f>min = nt i/4 • 

4. Calculate rj t (xi) = P[y = +l|xi, /i^.t-i]- 

0. ASSlgn Pi -P min + (L n Prnin)Y.% 1 flt{x ] )\u(l/f lt {x 3 ))H^-Vt(x ] ))\r i {l/(l-U^ 3 )))- 

end for 

6. Sample a point (say Xj) from p (•). 
if Xj was queried previously then 

7. Reuse its previously queried label j/j. 
else 

8. Query oracle O for its label yj. 
9. 

num.unique .queries num_unique_queries+l. 
end if 

10. Set Q\ = 1. 

11. Solve the optimization problem: /i^ it = argmin/j 6 % X)™=i St=i ^'PiVi^Xi). 

12. t<-t + l. 
end while 

13. Return h a = fiA t t 



Proof. By elementary linear algebra one can establish that 



h A = S z l v z (3) 
L T (h) = (h- t-^t^h - t^v - z). (4) 



(5) 



Using standard integrals we get 

Z = ( exp(-L T (/i)) dh = cxp(-c - v^t^v^V^Jdetitj 1 ). 

In order to calculate w T xo, it is now enough to calculate the integral 

1= exp(— Lt{}i)) h T xo dw. 

To solve this integral we proceed as follows. Define I± — f Rd exp(— L T {h)) h T x dh. By simple algebra we 
get 

1=1 cxp(— w T "t z w + 2w T v z — c) w T x Q dw (6) 

= cxp(-c-vj£~ 1 v z )l 1 . (7) 



Let a = h — Y, z 1 v z . We then get 

h= I h T x e^(-{h-t- 1 v z )t z {h-t- 1 v z )\ dh 

= / (a T xo + v^Yj~ 1 xq) exp(— a T £ z a) da 
JR d 

= / (a T xo) exp(— a T £ z a) da + / vJS~ 1 xo exp(— a T E z a) da . 

jR d JR d 

Clearly /2 being the integrand of an odd function over the entire space calculates to 0. To calculate I3 
we shall substitute £ z = SS T , where S >~ 0. Such a decomposition is possible since £ z >~ 0. Now define 
z = S T a. We get 

I 3 = vjt^xo I exp(-z T z) det(,5*" 1 ) dz (8) 



T S" 1 a;odet(S , - 1 )V^. (9) 



= v 

Using equations |7j |8j |9| we get 

I = {V^fv^t^xo det(S'- 1 ) exp(-c - ujE" 1 ^). (10) 

Hence we get 

T det(S' ) T-^-l uT 

%/detiM- 1 ) z 2 A 

where the penultimate equality follows from the fact that det(Ej 1 ) = l/det(E z ) = l/(det(S'S' T )) = 
l/(det(5')) 2 , and the last equality follows from equation [3J □ 

Theorem [2] is instructive. It tells us that assuming that the matrix E 2 is invertible, Ha is the same as an 
exponentially weighted average of all the hypothesis in %. Hence one can view UPAL as learning with expert 
advice, in the stochastic setting, where each individual hypothesis h £ % is an expert, and the exponential of 
Lt is used to weigh the hypothesis in H. Such forecasters have been commonly used in learning with expert 
advice. This also allows us to interpret UPAL as pruning the hypothesis space in a soft way via exponential 
weighting, where the hypothesis that has suffered more cumulative loss gets lesser weight. 

3 Bounding the excess risk 

It is natural to ask if UPAL is consistent? That is will UPAL do as well as the optimal hypothesis in H as 
n — > 00, T — > 00? We answer this question in affirmative. We shall analyze the excess risk of the hypothesis 
returned by our active learner, denoted as h^, after T rounds when the loss function is the squared loss. 
The prime motivation for using squared loss over other loss functions is that squared losses yield closed 



form estimators, which can then be elegantly analyzed using results from random matrix theory (Hsu et al. 
|2011a|b| |Tropp| [2010[ ) . It should be possible to extend these results to other loss functions such as the logistic 



loss, or exponential loss using results from empirical process theory (van de Geer 20001 



3.1 Main result 

Theorem 3. Let {x\, j/i), . . . {x n , y n ) be sampled i.i.d from a distribution. Suppose assumptions A0-A3 hold. 
Let 5 S (0, 1), and suppose n > no^ 7 T > maxjXo.,5, T1.5}. With probability atleast 1 — 106 the excess risk of 
the active learner returned by UPAL after T rounds is 

L{h A ) - L{f3) =o(~+ -^L(d + 2y/dhi(V8) + 2\n(l/5))) . 
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3.2 Assumptions, and Notation. 

AO (Invcrtibility of S) The data covariance matrix S is invertible. 

Al (Statistical leverage condition) There exists a finite 70 > 1 such that almost surely 

||£~ 1/2 z|| < 7o Vd. 

A2 There exists a finite 71 > 1 such that E[exp(a T a;)] < exp ( 7l ^ . 

A3 (Linear hypothesis) We shall assume that y = f3 T x + £(x), where £(x) £ [—2, +2] is additive noise with 
E[£(af)|a;] = 0. 

Assumption AO is necessary for the problem to be well defined. Al has been used in recent literature to 



analyze linear regression under random design and is a Bernstein like condition ( jRokhlin and Tygert 2008 ) 



A2 can be seen as a softer form of boundedness condtion on the support of the distribution. In particular 
if the data is bounded in a d-dimensional unit cube then it suffices to take 71 = 1/2. It may be possible to 
satisfy A3 by mapping data to kernel spaces. Though popularly used kernels such as Gaussian kernel map 
the data to infinite dimensional spaces, a finite dimensional approximation of such kernel mappings can be 



found by the use of random features ( Rahimi and Recht 2007 1 . 
Notation. 

1. h,A is the active learner outputted by our active learning algorithm at the end of T rounds. 



T 



Vi = 1, . . . , n : Zi = ^2 ~ f ^ = ^ ZiXixf 
i=l Pl i=l 

n _^ n 

ipz = Zi£(xi)xi e = - Xix 

i=l i=l 

n 

_?TDT.-„Tl df* ,. T , r r 



T 



i=l 

n 0<S = 7200d 2 7 ^(dln(5) + ln(10/<5)) T x ,s = 12 + 512V2d 8/3 7o 16/3 m 4/3 (d/<5) 



Amin(,^J 



where S G (0, 1). 



3.3 Overview of the proof 

The excess risk of a hypothesis h £ T~L is defined as L(h) — L(j3) — K x ,y~v[(y — h T x) 2 ~ (y — f3 T x) 2 ]. Our aim 
is to provide high probability bounds for the excess risk, where the probability measure is w.r.t the sampled 
points (xi,yi), . . . , (x n , y n ), Q\, . . . , Q^. The proof proceeds as follows. 

1. In lemma [l] assuming that the matrices S Z ,S are invertible we upper bound the excess risk as the 
product I \Y, l / 2 t- l Y}/ 2 \ | 2 1 |E- 1 / 2 E 1 /2| |2 1 ^-1/2^ 1 12_ xhc primc mot ivation in doing so is that bounding 
such "squared norm" terms can be reduced to bounding the maximum eigenvalue of random matrices, 
which is a well studied problem in random matrix theory. 
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2. In lemma[5]we provide an upper bound for 1 1 51; 1 / 2 5^; 1 / 2 1 1 2 _ To do this we use the simple fact that the 
matrix 2-norm of a positive semidefinite matrix is nothing but the maximum eigenvalue of the matrix. 
With this obsercation, and by exploiting the structure of the matrix E, the problem reduces to giving 
probabilistic upper bounds for maximum eigenvalue of a sum of random rank-1 matrices. Theorem [5] 
provides us with a tool to prove such bounds. 

3. In lemma [fj] we bound | lE^E^E 1 / 2 ] | 2 . The proof is in the same spirit as in lemma[5j however the 
resulting probability problem is that of bounding the maximum eigenvalue of a sum of random matrices, 
which are not necessarily rank-1. Theorem [6] provides us with Bernstein type bounds to analyze the 
eigenvalues of sums of random matrices. 

4. In lemma[7]we bound the quantity ||E _1 / 2 ^ Z || 2 . Notice that here we are bounding the squared norm 
of a random vector. Theorem [4] provides us with a tool to analyze such quadratic forms under the 
assumption that the random vector has sub-Gaussian exponential moments behaviour. 

5. Finally all the above steps were conditioned on the invertibility of the random matrices £, E z . We 
provide conditions on n, T (this explains why we defined the quantities Uq : s, 7b,<5, ^i,<s) which guarantee 
the invertibility of E,E Z . Such problems boil down to calculating lower bounds on the minimum 
eigenvalue of the random matrices in question, and to establish such lower bounds we once again use 
theorems [5] [G] 



3.4 Full Proof 

We shall now provide a way to bound the excess risk of our active learner hypothesis. Suppose 1%a was the 
hypothesis represented by the active learner at the end of the T rounds. By the definition of our active 
learner and the definition of P we get 

%a = argmin V" V" — Uy t - h T Xi) 2 = V* Zifa - h T Xi) 2 = Y>~ x v z (11) 
hen *■ — ' * — ' p- * — ' 

i=i t=i ^ % i=i 

P = argminE(y-/3 T x) 2 = E _1 E[yx]. (12) 

Lemma 1. Asumme E Z ,E are both invertible, and assumption AO applies. Then the excess risk of the 
classifier after T rounds of our active learning algorithm is given by 

L(h A ) - L(p) < iie^e^^^iHie-^e 1 / 2 !! 2 !!!]- 1 / 2 ^!! 2 . (13) 

Proof. 

L(h A ) - L(P) = E[(y - h T A xf - (y - p T xf] 

= ¥, x ^ y [h T A xx T h A - 2yh T A x - p T xx T P + 2yP T x] 

= h^Y,h A - 2h%E[xy] - /? T E/3 + 2/3 T E/3 [Since E/? = E[yx]j 

= h^hA - P T HP - 2h T A HP + 2/3 T E/3 

= h T A^h A + P T ZP - 2h^P 

= \\^lHh A -PW. (14) 



We shall next bound the quantity \\Ha — P\\ which will be used to bound the excess risk in Equation ( 14 1 



To do this we shall use assumption A3 along with the definitions of h,A,P- We have the following chain of 
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inequalities. 



h A = E 2 1 v z 

n 

— Tj z 1 y ZiyiXi 

i=l 
n 

i=l 
n 

i=l 

n 

Using Equations |14|15| we get the following series of inequalities for the excess risk bound 
L{h A )-L{[3) = W^t- 1 ^ 2 

= iie 1 / 2 ^ 1 ^ 1 / 2 ^- 1 / 2 ^ 1 / 2 ^- 1 / 2 ^!! 2 

< ||E 1 /2s-l S l/2||2|| E -l/2^1/2||2||g-l/2^||2 i 



(15) 



(16) 

□ 



The decomposition in lemma [l] assumes that both E z , E are invertible. Before we can establish conditions 
for the matrices E a , E to be invertible we need the following elementary result. 

Proposition 1. For any arbitrary a £ M. d , under assumption Al we have 

/3d 7o 2 |H' 2 ' 



E[exp(a T E" 1/2 a;)] < 5exp 



Proof. From Cauchy-Schwarz inequality and Al we get 

-IMlWd < -||o|| ||E- 1/2 x|| < a T YT x l 2 x < \\a\\ \\^ 1/2 x\\ < \\a\\j Vd. 
Also Efa T E _1 / 2 xl < ||a||7oV^i. Using Hoeff ding's lemma we get 



(17) 
(18) 
(19) 

< 5exp(3||a|| 2 d 7 ^/2). □ 
The following lemma will be useful in bounding the terms ||E 1/ ' 2 Ej 1 E 1 / 2 ||, HE -1 / 2 !] 1 / 2 !! 2 . 

Lemma 2. Let J = X^ILl E^^a^ic^E -1 / 2 . Let n > n ^. Then the following inequalities hold separately 
with probability atleast 1 — 5 each 



E[exp(a T E^ 1 / 2 x)] < exp ( ||a||7 Vd+ - 



|a|| 2 ^7o 



^max(J) <n + 6dwy 2 
A mt „( J) > n - 6dn^ 



32(dln(5) +ln(10/(5)) 2(dln(5) + ln(10/<5)) 
n n 

32(dln(5) + ln(10/<5)) 2(dln(5) + ln(10/5)) 



< 3n/2 
> n/2. 



(20) 
(21) 
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Proof. Notice that E[E- 1 / 2 x i a;fS- 1 / 2 ] = I. From Proposition[l]we have E[exp(a T S~ 1 / 2 a;)] < 5exp(3||a|| 2 d7 2 /2). 
By using theorem [5] we get with probability atleast 1 — 8: 



32(dln(5) + ln(2/«J)) 2(dln(5) + ln(2/J)) 



(22) 



Put n > no,5 to get the desired result. The lower bound on A m j n is also obtained in the same way. □ 
Lemma 3. Let n > no t g. With probability atleast 1 — 6 separately we have E y 0, A m ,„(E) > ^A m i„(E) 7 

Proof. Using lemma[2]we get for n > no. s with probability atleast 1 — 6, A m ; n ( J) > 1/2 and with probability 
atleast 1-5, A max (E) < 3/2. Finally since E 1 / 2 JE 1 / 2 = E, and J >- 0,E y 0, we get E >- 0. Further we 
have the following upper bound with probability atleast 1 — 6: 

A max (E) = ||S 1 / 2 JsV2|| (23) 

< ||£ 1/2 || 2 ||J|| (24) 

< ll s ll 11-711 (25) 
= A max (£)A max (J) (26) 

< ^A max (S), (27) 

where in the last step we used the upper bound on A max (J) provided by lemma [2] Similarly we have the 
following lower bound with probability atleast 1 — 6 

A min (E) = - ,„ , /2 T _,„_ im ( 28 ) 



A max (E-V2j-i S -i/2) 
1 

lE-Vaj-is-Vai 



(29) 



" | 1 1 ||J-i|| ||E-V2|| (30) 
= A min (£)A min (J) (31) 

> Ami ^ (E) , (32) 

where in the last step we used the lower bound on A m ; n (J) provided by lemma [2] □ 
The following proposition will be useful in proving lemma |4j 

Proposition 2. Let 6 G (0,1). Under assumption A2, with probability atleast 1 — 6, X)"=i ll x il| 4 — 
25 7 fd 2 ln 2 (n/5) 

Proof. From A2 we have E[exp(a T a;)] < exp( ^° ^ 7l ). Now applying theorem [4] with A — La we get 

P[| W| 2 < d 7 f + 2 7l 2 Vdln(l/J) + 2 7l 2 ln(l/5)] > 1 - <J. (33) 
The result now follows by the union bound. □ 

Lemma 4. Let 6 G (0, 1). For T > Tq s, with probability atleast 1 — 46 we have A m i„(E z ) > nTA ™"( s ) > 0. 
Hence E z is invertible. 
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Proof. The proof uses theorem]^ Let M' t = J27=l $ x i x T> so that ^ = Et=i M 'f Now E * M t' = nS. Define 
B! t = nT, — M' t , so that Et-Rj = 0. We shall apply theorem |i] to the random matrix iij. In order to do 
so we need upper bounds on A max (i?j) and A max (^ E^Li^t-^t 2 )- Let n > rioj. Using lemmajijwe get with 
probability atleast 1 — 8 

XmM) = A max (n£ - Ml) < A max (n£) < 3nX ™*^ d = f 6 2 . (34) 



t=i 



— j;A D 



^E t (n£-Af t ') 2 



{=1 



^ A max (-n 2 T£ 2 + f; E t £ ^ fox?) 

fei (P*) 



T 



T n 



= ^A raax (-n 2 TE 2 + ^ ^ A ( x » 3 
t=l j=l Pi 

< ^A max (f:E^(^f) 2 )-n 2 A 2 nin (S) 

n 

< nT^A^^foxf ) 2 ) 

i=l 

n 

< IlT 1/4 ^ ^aax( x i x f) 



i=l 



nT^J2\\ Xl 



i=i 



< 257 1 dVr 1 / 4 ln 2 (n/^)= ( 7 



(35) 

(36) 

(37) 

(38) 

(39) 

(40) 

(41) 
(42) 



follows from Equation |37| by Weyl's inequality. Equation 39 

in place of p\. Equation |40| follows from Equation 39 by the use of Weyl's inequality. Equation 



Equation 36 follows from Equation 35 by the definition of M' t and the fact that at any given t only one point 
is queried i.e. Q\Q\ = for a given t. Equation 37 follows from equation 36 since E t Q\ — p\. Equation 38 

follows from Equation 38 by substituting p^~ 

follows 

from Equation 40 by using the fact that if p is a vector then A max (pp T ) = ||p|| 2 . Equation 



42 



follows from 



Equation [44] by the use of proposition [2j Notice that this step is a stochastic inequality and holds with 
probability atleast 1 — 8. 

Finally applying theorem [6] we have 



1 

4=1 
1 T 



2a\ ln(d/<5) b 2 \n(d/8) 



T 



T 



t=i 

T 



Amin( n ^]) jn^inin 



< 



< 



2ct| ln(d/«J) b- 2 ln(d/<5) 



T 



T 



\t=i 



2cr 2 \n(d/S) b 2 \n(d/8) 
T f 



> 1-8 



>1-S 



>l-8 



(43) 
(44) 
(45) 
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Substituting for 172,62, rearranging the inequalities, and using lemma [3] to lower bound A m j n (E) we get 

> 1-5 



A mi „(^ M' t ) > TA min (nE) - ^2To\ ln(d/<5) - b 2 \n(d/S) 

2Tal ln(d/<5) - b 2 Ln(d/5) 



( = 1 

T 



A min (E^)> nTA 7 (S) 
t=i 

W£M) > " r W S) - ^Hd^l^y/HmHnlS) - nln(d/ ^ Amax(S) 

For T > T Q .s with probability atleast 1 - 4(5, A min ^=1 M 't = A min(E 2 ) > " TA °-( S ) , 
Lemma 5. For n > 710,5 wi/i probability atleast 1 — 5 ovcv the TUTidoui sample X\ , . . . , x n 

||E-i/asV3||a< 3/2i 

Proof. 

|| E -1/2 S 1/2||2 = ||£1/2 E -1/2||2 

= ^(E-V^E-Va) 



> 1 - 25 



> 1 — 4(5 



At, 



< 3/2 



i=l / 



□ 

(46) 

(47) 
(48) 

(49) 

(50) 
(51) 



where in the first equality we used the fact that ||j4.|| = ||A T || forasquare matrix A, and ||^4|| 2 = X max (A T A), 
and in the last step we used lemma [2] □ 

Lemma 6. Suppose E z is invertible. Given 5 £ (0, 1), for n > 710,5, and T > maxjTo.a, T\_s\ with probability 
atleast 1 — 3(5 over the samples 

400 



| S 1/2 E -1 S 1/2||2 < 



n 2 T 2 ' 



Proof. The proof of this lemma is very similar to the proof of lemma[4] From lemmaHjfor n > 710,5, T > To. 5 
with probability atleast 1 — 5, E z >- 0. Using the assumption that E >- 0, we get E^ 2 Ej 1 E 1 / 2 >- 0. Hence 



| E i/2 E -i E i/ 2 || = Anuut ( E i/a s -i E i/2 ) 



x mto (s- 1 / 2 s^s- 1 / 2 ) 

on the smallest eigenvalue of the symmetric positive definite matrix E~ 1 / 2 E 2 E _1//2 



Hence it is enough to provide a lower bound 



A min (E- 1 / 2 E z E- 1 / 2 ) = A^n (^ZiE-VVcfE- 1 



/a 



\i=l 
T 1 



Amjn( ^~^ ^ ] t E ^ S^X^ E ^ ) 
i=l i=l P * 
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Define R t = J — M t . Clearly E t [M t ] = J, and hence E[i? t ] = 0. From Weyl's inequality we have A m i n (J) + 
Amax J2t=i Mtj < A max (^ J2t=i Rt)- Now applying theorem J^J on R t we get with probability atleast 



'max 

1-5 



^min 

(J) + A 

11 



< 



2o\ ln(d/«J) 6i ln(d/<5) 



T 



3T 



where 



t=i / 



(52) 

(53) 
(54) 



Rearranging Equation (52 1 and using the fact that A max (— A) = — A m j n (^4) we get with probability atleast 
1-8, 



(E Mt J ^ T W^) - y/zTa* \n(d/S) 



bi Md/S) 



(55) 



Using Weyl's inequality (Horn and Johnson 1990) we have A max (y2^ t=1 J — M t ) < A max (J) < ^ with 



probability atleast 1 — 8, where in the last step we used lemma ([2j). Let b\ = To calculate a\ we proceed 
as follows. 



^jy^J-M,)^ = iA max (j2E t (M?)-J^ 
<^A max (E E * M ') 



/2 r . T T r -l/2 



t=\ \i=l y i 

^A max (e^E^^ 172 ^^ 1 



^EEiii^ 172 ^! 4 



< 



< 



,9 4 n T 

i=l t=l yi 
T 1 

V — 



T ^[ Pmin 



(56) 
(57) 
(58) 
(59) 
(60) 
(61) 
(62) 

(63) 
(64) 



Equation 57 follows from Equation 56 by using Weyl's inequality and the fact that J > 0. Equation 59 



follows from Equation 58 since only one point is queried in every round and hence for any given t,i =/= j 
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we have QlQj — 0, and hence all the cross terms disappear when we expand the square. Equation (60) 
follows from Equation (59) by using the fact that E t Q t = p t . Equation (61) follows from Equation (60 1 ay 



Weyl's inequality and the fact that the maximum eigenvalue of a rank-1 matrix of the form vv T is ||i>" 2 



Equation (62) follows from Equation (61) by using assumption Al. Equation |64| follows from Equation (|63J) 
by our choice of p*„j„ = Substituting the values of a 2 , &i in 55 using lemma |2j to lower bound A mm (J), 
and applying union bound to sum up all the failure probabilities we get for n > 710,5, T > max{Xo,5, Ti,5} 
with probability atleast 1 — 35, 



MA > TA min ( J) - y/zTVWcPrfhiWS) - 3n/2 
\t=i / 



> ~ - V2T 5/s nd^^/\n{d/S) - 3n/2 > nT/4. □ 

The only missing piece in the proof is an upper bound for the quantity ||E _1 / 2, Z || 2 . The next lemma 
provides us with an upper bound for this quantity. 

Lemma 7. Suppose £ is invertible. Let 5 € (0, 1). With probability atleast 1 — 6 we have 



||E~ 1/: fy*l| 2 < {2nT 2 + mn 3 TVf)(d + 2y^Ml/S) + 21n(l/<5)). 

Proof. Define the matrix A E R dxn as follows. Let the i th column of A be the vector - J^ Xi , so that AA T = 

^S-WxixfS-V 2 = I d . Now IIE-Va^na = \\^iAp\\ 2 , where p = { Pl , . . . , Pn ) G R" and Vl = £( Xl )z t for 
i = 1, ... ,n. Using the result for quadratic forms of subgaussian random vectors (threorem [4]) we get 



\\Ap\\ 2 < (7 2 (tr(7 d ) + 2Vtr(/ d )ln(l/<5) + 2||/ d ||ln(l/<S)) = a 2 {d + 2^d\n{\/5) + 21n(l/<f)), (65) 
where for any arbitrary vector a, E[exp(a T p)] < exp(||a|| 2 cr 2 ). □ 
Hence all that is left to be done is prove that a T p has sub-Gaussian exponential moments. Let 

Dt «^£if^M_ a ^ W = 1,...,T. (66) 
1=1 Pi 

With this definition we have the following series of equalities 

E[exp(a T p)] = E[exp(^ D t + Ta T £)} = E [exp(Ta T £)E[exp(^ D t )\V n ] . (67) 

Conditioned on the data, the sequence D\,...,Dt, forms a martingale difference sequence. Let £ = 
[£(xi),... , Notice that 

-^-«<A<-a^ + « (68) 

Pmin P mm 



We shall now bound the probability of large deviations of D t given history up until time t. This allows us to 
put a b 
we get 



put a bound on the large deviations of the martingale sum Y^t=i Dt- Let a > 0. Using Markov's inequality 



[A > a\Q\ l -\V n ] < mm exp(- 7 a)E[ 7 A|Qi:^ 1 , A] (69) 

7>0 

/27 2 ||a|| 2 \ , s 

s "Hip^ a ) (70) 

" 6XP \8\\a\\ 2 n 2 Vi) ' ^ 
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In t he s econd step we used Hoeffding's lemma along with the boundedness property of D t shown in equa- 
The same upper bound can be shown for the quantity F[D t < a\Q{^~ , T> n ]. Applying lemmaMwe 



(>8 



tion 

get with probability atleast 1 — 5, conditioned on the data, we have 

If; MWlna ffl ^ ^ D ,^ 112 | |a||wa/21n(lffl . (72) 

t=l V i=l 

Hence Y^t=i-Dt, conditioned on data, has sub-Gaussian tails as shown above. This leads to the following 
conditional exponential moments bound 

T 

E[exp(^ D t )\V n ] = exp (56||a|| VrVTln(l/5)) . (73) 

i=l 

Finally putting together equations [67] [73] we get 

E[exp(a T p)] < Eexp(Ta T C)exp(56||a|| 2 n 2 Tv / T) < exp((2T 2 + 56n 2 lVT)|H| 2 ), (74) 
In the last step we exploited the fact that —2 < < 2, and hence by Hoeffding lemma E[exp(a T £)] < 



65 



exp(2||a|| ). This leads us to the choice of a = 2T + 56n T\jT . Substituting this value of er in equation 
we get _ 

H^pII 2 < (2T 2 + 56n 2 7VT)(d + 2y/d\n(l/5) + 21n(l/<J)), (75) 

and hence with probability atleast 1 — 5, 

\\t- 1/2 tjj z \\ 2 = n\\Ap\\ 2 < (2nT 2 + 56n 3 TVT)(d + 2^d\n(l/5) + 21n(l/<J)). (76) 
We are now ready to prove our main result. 

Proof of theorem^ For n > tiqj and T > maxjTo^, Tu} from lemma [ij [4J both S z , and S are 
invertible with probability atleast 1 — S, 1 — 46 respectively. Conditioned on the invertibility of S z , E we 
get from lemmas |i|7| 1 1 S~ 1 S! 1 / 2 1 1 2 < 3/2 and 1 1 S x / 2 S~ 1 S x / 2 1 1 2 < 400/n 2 T 2 , and Wt- 1 / 2 ^^ 2 < (2nT 2 + 
56n 3 T 3 / 2 )(d+2y<d\n(l/5) + 21n(l/(5)) with probability atleast 1-5, 1-35, 1-5 respectively. Using lemma[l] 
and the union bound to add up all the failure probabilities we get the desired result. □ 



4 Related Work 



A variety of pool based AL algorithms have been proposed in the literature employing various query strate- 
gies. However, none of them use unbiased estimates of the risk. One of the simplest strategy for AL is 
uncertainty sampling, where the active learner queries the point whose label it is most uncertain about. This 



strategy has been popularl in text classification ( Lewis and Gale 1994 ) , and information extraction ( Settles 



and Craven 2008 1 . Usually the uncertainty in the label is calculated using certain information-theoretic cri- 



teria such as entropy, or variance of the label distribution. While uncertainty sampling has mostly been used 
in a probabilistic setting, AL algorithms which learn non-probabilistic classifiers using uncertainty sampling 
have also been proposed. Tong et al. (2001) proposed an algorithm in this framework where they query 
the point closest to the current svm hyperplane. Seung et al. ( 1992 ) introduced the query-by-committee 



(QBC) framework where a committee of potential models, which all agree on the currently labeled data is 
maintained and, the point where most committee members disagree is considered for querying. In order to 
design a committee in the QBC framework, algorithms such as query-by-boosting, and query-by-bagging in 
the discriminative setting (Abe and Mamitsuka 1998), sampling from a Dirichlet distribution over model 
parameters in the generative setting (McCallum and Nigam, 19981 have been proposed. Other frameworks 



include querying the point, which causes the maximum expected reduction in error (Zhu et al., 2003 Guo 



and Greiner 20071, variance reducing query strategies such as the ones based on optimal design (Flaherty 
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et al. 


2005 


Zhang and Oles 


2000) 


has been done by Settles ( 2009 


). A 



AL algorithms that are consistent and have provable label complexity have 



been proposed for the agnostic setting for the 0-1 loss in recent years (Dasgupta et al. 2007 Beygelzimer 
et al. 2009). The IWAL framework introduced in Beygelzimer et al. (2009) was the first AL algorithm 



with guarantees for general loss functions. However the authors were unable to provide non-trivial label 
complexity guarantees for the hinge loss, and the squared loss. 

UPAL at least for squared losses can be seen as using a QBC based querying strategy where the committee 
is the entire hypothesis space, and the disagreement among the committee members is calculated using an 
exponential weighting scheme. However unlike previously proposed committees our committee is an infinite 
set, and the choice of the point to be queried is randomized. 



5 Experimental results 

We implemented UPAL, along with the standard passive learning (PL) algorithm, and a variant of UPAL 
called RAL (in short for random active learning), all using logistic loss, in matlab. The choice of logistic 
loss was motivated by the fact that BMAL was designed for logistic loss. Our matlab codes were vectorized 
to the maximum possible extent so as to be as efficient as possible. RAL is similar to UPAL, but in each 
round samples a point uniformly at random from the currently unqueried pool. However it does not use 
importance weights to calculate an estimate of the risk of the classifier. The purpose of implementing RAL 
was to demonstrate the potential effect of using unbiased estimators, and to check if the strategy of randomly 
querying points helps in active learning. 



We also implemented a batch mode active learning algorithm introduced by Hoi et al. (2006) which, we 
shall call as BMAL. Hoi et al. in their paper showed superior empirical performance of BMAL over other 
competing pool based active learning algorithms, and this is the primary motivation for choosing BMAL 
as a competitor pool AL algorithm in this paper. BMAL like UPAL also proceeds in rounds and in each 
iteration selects k examples by minimizing the Fisher information ratio between the current unqueried pool 
and the queried pool. However a point once queried by BMAL is never requeried. In order to tackle the 
high computational complexity of optimally choosing a set of k points in each round, the authors suggested 
a monotonic submodular approximation to the original Fisher ratio objective, which is then optimized by 
a greedy algorithm. At the start of round t + 1 when, BMAL has already queried t points in the previous 
rounds, in order to decide which point to query next, BMAL has to calculate for each potential new query a 
dot product with all the remaining unqueried points. Such a calculation when done for all possible potential 
new queries takes 0(n 2 t) time. Hence if our budget is B, then the total computational complexity of BMAL 
is 0(n 2 B 2 ). Note that this calculation does not take into account the complexity of solving an optimization 
problem in each round after having queried a point. In order to further reduce the computational complexity 
of BMAL in each round we further restrict our search, for the next query, to a small subsample of the 
current set of unqueried points. We set the value of p m i n in step 3 of algorithm 1 to i In order to avoid 
numerical problems we implemented a regularized version of UPAL where the term A||u>|| 2 was added to the 
optimization problem shown in step 11 of Algorithm 1. The value of A is allowed to change as per the current 
importance weight of the pool. The optimal value of C in VW Q was chosen via a 5 fold cross-validation, 
and by eyeballing for the value of C that gave the best cost-accuracy trade-off. We ran all our experiments 
on the MNIST dataset(3 Vs 5)0 and datasets from UCI repository namely Statlog, Abalone, Whitewine. 
Figure [l] shows the performance of all the algorithms on the first 300 queried points. On the MNIST 
dataset, on an average, the performance of BMAL is very similar to UPAL, and there is a noticeable gap 
in the performance of BMAL and UPAL over PL, VW and RAL. Similar results were also seen in the case 
of Statlog dataset, though towards the end the performance of UPAL slightly worsens when compared to 
BMAL. However UPAL is still better than PL, VW, and RAL. 



The parameters initiaLt, I were se t to a default value of 10 for all of our experiments. 
2 The dataset can be obtained from http://cs.nyu.edu/~roweis/data.html We first performed PCA to reduce the dimen- 
sions to 25 from 784. 
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(d) Whitewine 



Figure 1: Empirical performance of passive and active learning algorithms. The x-axis represents the number 
of points queried, and the y-axis represents the test error of the classifier. The subsample size for approximate 
BMAL implementation was fixed at 300. 



Sample size 


UPAL 


BMAL 




Time 


Error 


Time 


Error 


1200 


65 


7.27 


60 


5.67 


2400 


100 


6.25 


152 


6.05 


4800 


159 


6.83 


295 


6.25 


10000 


478 


5.85 


643.17 


5.85 



Table 1: Comparison of UPAL and BMAL on MNIST data-set of varying training sizes, and with the budget being fixed at 
300. The error rate is in percentage, and the time is in seconds. 
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Budget 


UPAL 


BMAL 


Speedup 




Time 


Error 


Time 


Error 




500 


859 


5.79 


1973 


5.33 


2.3 


1000 


1919 


6.43 


7505 


5.70 


3.9 


2000 


4676 


5.82 


32186 


5.59 


6.9 



Table 2: Comparison of UPAL on the entire MNIST dataset for varying budget size. All the times are in seconds unless 
stated, and error rates in percentage. 



Active learning is not always helpful and the success story of AL depends on the match between the 
marginal distribution and the hypothesis class. This is clearly reflected in Abalone where the performance 
of PL is better than UPAL atleast in the initial stages and is never significantly worse. UPAL is uniformly 
better than BMAL, though the difference in error rates is not significant. However the performance of RAL, 
VW are significantly worse. Similar results were also seen in the case of Whitewine dataset, where PL 
outperforms all AL algorithms. UPAL is better than BMAL most of the times. Even here one can witness 
a huge gap in the performance of VW and RAL over PL, BMAL and UPAL. 

One can conclude that VW though is computationally efficient has higher error rate for the same number 
of queries. The uniformly poor performance of RAL signifies that querying uniformly at random does not 
help. On the whole UPAL and BMAL perform equally well, and we show via our next set of experiments 
that UPAL has significantly better scalability, especially when one has a relatively large budget B. 



5.1 Scalability results 

Each round of UPAL takes 0(n) plus the time to solve the optimization problem shown in step 11 in 
Algorithm 1. A similar optimization problem is also solved in the BMAL problem. If the cost of solving this 
optimization problem in step t is c op t,t, then the complexity of UPAL is OinT + Y^t=i °opt,t)- While BMAL 
takes 0(n 2 B 2 + Y^t=i c 't opt) where c' t opt is the complexity of solving the optimization problem in BMAL in 
round t. For the approximate implementation of BMAL that we described if the subsample size is \S\, then 
the complexity is 0(\S\ 2 B 2 + Y]f =1 c' t opt ). 

In our first set of experiments we fix the budget B to 300, and calculate the test error and the combined 
training and testing time of both BMAL and UPAL for varying sizes of the training set. All the experiments 
were performed on the MNIST dataset. Table [I] shows that with increasing sample size UPAL tends to be 
more efficient than BMAL, though the gain in speed that we observed was at most a factor of 1.8. 

In the second set of scalability experiments we fixed the training set size to 10000, and studied the effect 
of increasing budget. We found out that with increasing budget size the speedup of UPAL over BMAL 
increases. In particular when the budget was 2000, UPAL is arpproximately 7 times faster than BMAL. All 
our experiments were run on a dual core machine with 3 GB memory. 



6 Conclusions and Discussion 

In this paper we proposed the first unbiased pool based active learning algorithm, and showed its good 
empirical performance and its ability to scale both with higher budget constraints and large dataset sizes. 
Theoretically we proved that when the true hypothesis is a linear hypothesis, we are able to recover it with 
high probability. In our view an important extension of this work would be to establish tighter bounds on the 
excess risk. It should be possible to provide upper bounds on the excess risk in expectation which are much 
sharper than our current high probability bounds. Another theoretically interesting question is to calculate 
how many unique queries are made after T rounds of UPAL. This problem is similar to calculating the number 
of non-empty bins in the balls-and-bins model commonly used in the field of randomized algorithms Motwani 



and Raghavan (1995), when there are n bins and T balls, with the different points in the pool being the 



bins, and the process of throwing a ball in each round being equivalent to querying a point in each round. 
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However since each round is, unlike standard balls- and- bins, dependent on the previous round we expect the 
analysis to be more involved than a standard balls-and-bins analysis. 

References 

N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In ICML, 1998. 

E.B. Baum and K. Lang. Query learning can work poorly when a human oracle is used. In IJCNN, 1992. 

A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009. 

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Press, 2006. 

W. Chu, M. Zinkevich, L. Li, A. Thomas, and B. Tseng. Unbiased online active learning in data streams. 
In SIGKDD, 2011. 

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2), 
1994. 

S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. NIPS, 2007. 

Patrick Flaherty, Michael I. Jordan, and Adam P. Arkin. Robust design of biological experiments. In Neural 
Information Processing Systems, 2005. 

Y. Guo and R. Greiner. Optimistic active learning using mutual information. In IJCAI, 2007. 

S.C.H. Hoi, R. Jin, J. Zhu, and M.R. Lyu. Batch mode active learning and its application to medical image 
classification. In ICML, 2006. 

R.A. Horn and C.R. Johnson. Matrix analysis. Cambridge Univ Press, 1990. 

D. Hsu, S.M. Kakade, and T. Zhang. An analysis of random design linear regression. Arxiv preprint 
arXiv:! 106.2363, 2011a. 

D. Hsu, S.M. Kakade, and T. Zhang. Dimension-free tail inequalities for sums of random matrices. Arxiv 
preprint arXiv: 1104.1672, 2011b. 

J. Langford, L. Li, A. Strchl, D. Hsu, N. Karampatziakis, and M. Hoffman. Vowpal wabbit, 2011. 

D.D. Lewis and W.A. Gale. A sequential algorithm for training text classifiers. In SIGIR, 1994. 

AE Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegcrmann. Smallest singular value of random matrices 
and geometry of random polytopes. Advances in Mathematics, 195(2):491-523, 2005. 

A.K. McCallum and K. Nigam. Employing EM and pool-based active learning for text classification. In 
ICML, 1998. 

Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, 1st edition, 
August 1995. 

J. Quinonero, M. Sugiama, A. Schwaighofer, and N.D. Lawrence. Dataset shift in machine learning, 2008. 

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Neural Information 
Processing Systems, 2007. 

V. Rokhlin and M. Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. 
Proceedings of the National Academy of Sciences, 105(36):13212, 2008. 



18 



B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, 
2008. 

Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of 
Wisconsin-Madison, 2009. 

H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In COLT, pages 287-294. ACM, 1992. 

O. Shamir. A variant of azuma's inequality for martingales with subgaussian tail. Arxiv preprint 
arXiv: 11 10.2392, 2011. 

S. Tong and E. Chang. Support vector machine active learning for image retrieval. In Proceedings of the 
ninth ACM international conference on Multimedia, 2001. 

J. A. Tropp. User-friendly tail bounds for sums of random matrices. Arxiv preprint arXiv: 1004-^389, 2010. 

Sara van de Geer. Empirical processes in m-estimation. 2000. 

T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. 
Annals of Statistics, 32(1), 2004. 

T. Zhang and F. Oles. The value of unlabeled data for classification problems. In ICML, 2000. 

Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-supervised learning 
using gaussian fields and harmonic functions. In ICML, 2003. 



A Some results from random matrix theory 



Theorem 4. (Quadratic forms of subgaussian random vectors \Litvak et al. 2005' Hsu et al., 2011a)) Let 



Ae 



be a matrix, and H = AA , and r — (r\, . . . , r n ) be a random vector such that for some a > 0, 



E[exp(a T r)] < exp 



for all a G 1™ almost surely. For all 5 £ (0,1), 

\Ar\\ 2 > a 2 tr{H) + 2<7 2 y/tr{H 2 )ln(l/6) + 2<t 2 \\H\\Iyl(1/5) 



< S. 



The above theorem was first proved without explicit constants by Litvak et al. ( |Litvak et al. 2005 ) Hsu 
et al ( Hsu et ah] |2011a ) established a version of the above theorem with explicit constants. 



Theorem 5. (Eigenvalue bounds of a sum of rank-1 matrices) Let n, . . .r n be random vectors in M. d such 
that, for some 7 > 0, 



For allSe (0,1), 



where 



E[nrf\ri, . . .,ri-x] = I 
E[exp(a T rO|ri, . . . ,n-i] < exp(| |a| | 2 7 /2) Va G R d . 



- ) >1 + 2es < n v Xmm f ~ nr * ) < 1 ~ 2es >' 



»=i 



e<5,n = 7 



< 5, 



32(d ln(5) + ln(2/«J)) 2(dln(5) + ln(2/<5)) 
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We shall use the above theorem in Lemma |3j and lemma [2j 

X n be symmetric valued random matrices. Suppose there 



Theorem 6. (Matrix Bernstein bound) Let X\ 
exist b, a such that for all i = 1, . . . ,n 



Ei[Xi] = 

^max(Xi) < b 



almost surely, then 



> 



2a 2 \n(d/5) bbx(d/5) 
n 3n 



< 6. 



(77) 



A dimension free version of the above inequality was proved in Hsu et al (Hsu et al. , 2011b). Such dimension 
free inequalities are especially useful in infinite dimension spaces. Since we are working in finite dimension 
spaces, we shall stick to the non-dimension free version. 



Theorem 7. (Shamir, 2011) Let {Z\, Fi), ■ ■ ■ , (Zt, Ft) be a martingale difference sequence, and suppose 
there are constants b > 1, Cj > such that for any t and any a > 0, 

max{P[Z f > o|^ t _i],P[Z t < -o|^ t _i]} < bcxp(-c t a 2 ). 

Then for any 6 > 0, with probability atleast 1 — 6 we have 



1 T 



< 



*=i 



/2861n(l/<5) 



The above result was first proved by Shamir ( Sh amir) 2011 ). Shamir proved the result for the case when 
Ci = . . . = Or- Essentially one can use the same proof with obvious changes to get the above result. 



Lemma 8 (Hocffding's lemma), (see Cesa-Bianchi and Lugosi 2006, page 359) Let X be a random variable 
with a < X < b. Then for any seK 



E[exp(sX)] < exp sE[X] + 



s 2 {b-af 



(78) 



Theorem 8. Let A, B be positive semidefinite matrices. Then 

^max(A) + X m i n (B) < X max (A + B) < \ m ax(A) + \ m ax{B). 

The above inequalities are called as Weyl's inequalities (see Horn and Johnson\ 1990\ chap. 3) 
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